De novo diploid genome assembly using long noisy reads

The high sequencing error rate has impeded the application of long noisy reads for diploid genome assembly. Most existing assemblers failed to generate high-quality phased assemblies using long noisy reads. Here, we present PECAT, a Phased Error Correction and Assembly Tool, for reconstructing diploid genomes from long noisy reads. We design a haplotype-aware error correction method that can retain heterozygote alleles while correcting sequencing errors. We combine a corrected read SNP caller and a raw read SNP caller to further improve the identification of inconsistent overlaps in the string graph. We use a grouping method to assign reads to different haplotype groups. PECAT efficiently assembles diploid genomes using Nanopore R9, PacBio CLR or Nanopore R10 reads only. PECAT generates more contiguous haplotype-specific contigs compared to other assemblers. Especially, PECAT achieves nearly haplotype-resolved assembly on B. taurus (Bison×Simmental) using Nanopore R9 reads and phase block NG50 with 59.4/58.0 Mb for HG002 using Nanopore R10 reads.


Introduction
De novo genome assembly is a fundamental task in genomic research [1][2][3] . Using long-reads from sequencing technologies, such as single-molecule real-time (SMRT) sequencing of Pacific Biosciences (PacBio) and Nanopore sequencing of Oxford Technologies (ONT), many assemblers now can effectively reconstruct high-quality genome sequences for haploid or inbred species. However, many genetic information of a diploid genome is lost in those assemblies and most of them have lots of haplotype switch errors. Because long noisy reads such as PacBio CLR reads and Nanopore reads usually contain a 5-15% sequencing error rate 4 , it is difficult to distinguish heterozygotes from sequencing errors in long noisy reads, which prevents the diploid assemblers from generating long haplotype-specific contigs. Recently, by combining high accurate PacBio HiFi reads 5 (≤ 1% error rate) with additional sequences, such as parental short reads 6 , Hi-C reads 7,8 , Strand-seq data 9 , or gamete cell data 10 , assemblers now can produce a more contiguous haplotype-specific contigs 7,11,12 . However, the requirements of additional sequencing increase costs and limit their applications in practice. Furthermore, the average lengths of long noisy reads are longer than those of HiFi reads (10-25 kb). For example, the ultra-long reads of Nanopore are up to 1M in length and with read N50 >100 kb. Longer reads can help to assemble more contiguous contigs. Therefore, it will be useful to develop an assembler for diploid genomes using long noisy reads only, especially for species with high and moderate heterozygosity.
To achieve high-quality assemblies, error correction is usually a useful step for genome assembling using long noisy reads. One challenge to assembling a diploid genome using long noisy reads is how to retain heterozygotes during sequence error correction. Current correct-then-assemble pipeline [13][14][15][16] eliminated heterozygotes as sequencing errors, or mixed alleles of different haplotypes in a read. Therefore, corrected reads don't contain heterozygote information for haplotype phasing.
Assemblers, such as FALCON-Unzip 13 , used raw reads instead of corrected reads in their "unzip" phasing step, then mapped the raw reads to the corrected reads for the later diploid assembly. This has led to assembly errors due to the loss of heterozygotes in the corrected reads. Furthermore, phasing raw reads is more difficult due to their high error rate.
Another challenge in the phasing step is to partition reads according to their haplotypes. After error correction, there still are sequencing errors left in corrected reads. And the error profile of corrected reads is more complex than that of PacBio HiFi reads. In many parts of the diploid genome, the sequencing error of corrected reads is still close to or even exceeds haplotype divergence. This leads to inconsistent overlaps 11 among reads, whose reads come from different haplotypes or different copies of the segmental duplication. The inconsistent overlaps can cause haplotype switch errors or unresolved repeats (Supplementary Figure 1). In our tests, 13.60-36.78% of edges in simplified overlap graphs are inconsistent overlaps during assembling diploid genomes.
Identifying and removing inconsistent overlaps is important in overlap-graph-based diploid genome assembly, even if we assemble the genome using PacBio HiFi reads 11 .
The false-positive SNPs in corrected reads make identifying inconsistent overlaps more difficult. It is necessary to design a more accurate and robust method to identify inconsistent overlaps for long noisy read-based assemblers.
Furthermore, to construct two haplotypes of diploid genomes, the assemblers often need more sequencing data. However, with the increasing amount of sequencing data, the running time of assembly increases nonlinearly. Therefore, assembling large diploid genomes at an acceptable time is another challenge. Overlap-graph-based assemblers have their unique advantages to assemble diploid genomes since the overlaps can be reused in subsequent steps once they are found. This helps to design an efficient multiround assembly strategy. The overlap-graph-based assemblers usually use seed (i.e., kmers, minimizers) based methods to find candidate overlaps, then perform local alignment to find the true overlaps. The local alignment is the major computational bottleneck of overlap-graph-based approaches. Although skipping local alignment can accelerate the overlap finding step 17 , it leads to lots of false-positive overlaps and introduces errors in assembly graphs. String-graph-based 18 approaches, a type of overlap-graph-based approach, ignore the reads contained by other reads and mark most edges in the graph as transitive edges 18 that don't contribute to the construction of contigs. Then, it is not necessary to perform local alignment on those read pairs. Therefore, it is possible to design a fast string-graph-based diploid genome assembler by minimizing the number of local alignments needed.
In this work, we present PECAT, a phased error correction and assembly tool designed to reconstruct diploid genomes from long noisy reads, including PacBio CLR reads and Nanopore reads. PECAT follows the correct-then-assemble strategy, including an error correction module and a two-round string graph-based assembly module (Figure 1). We design a haplotype-aware error correction method for long noisy diploid reads, which can retain heterozygote alleles while correcting sequencing errors. In the first round of assembly, we assemble the corrected reads and generate haplotype-collapsed contigs. In the second round of assembly, we identify and remove the inconsistent overlaps and assemble the corrected reads to -prmary contigs (long contigs with the mosaic of homologous haplotypes) and alternate contigs (short haplotype-specific contigs). We propose a read-level SNP caller and a read grouping method for identifying inconsistent overlaps, which can effectively eliminate the effects of sequencing errors and false-positive SNPs. To accelerate the assembling, PECAT only performs local alignment when it is necessary instead of performing local alignments on all candidate overlaps. PECAT can efficiently assemble diploid genomes using only long noisy reads and generate more contiguous haplotype-specific contigs compared to other assemblers.

Haplotype-aware error correction
Our error correction method is based on the partial-order alignment (POA) graph 19 method. For the template read to be corrected, a POA graph is built from the alignment of supporting reads. Then, the path with the highest weight is found in the POA graph to construct the consensus sequence for the template read. However, if the sequencing error rate exceeds haplotype divergence, current methods inevitably select supporting reads from different haplotypes. This causes either heterozygote alleles in the template reads to be eliminated as sequencing errors or heterozygous alleles from different haplotypes to be mixed in corrected reads.
To maintain the heterozygote alleles in the template read, we need to select supporting reads from the same haplotype to correct it. After analyzing the POA graph, we have found that the difference between random sequencing errors and heterozygotes can be reflected in the POA graph. In the case of heterozygotes, there are two dominant parallel branches in the POA graph. In the case of random errors, there tends to be only one dominant branch. Based on this finding, we design a scoring algorithm to estimate the likelihood that the supporting read and the template read are from the same haplotype (Methods). For each position at which there are two dominant parallel branches, if the supporting read and the template read pass through the same dominant branch, the score is increased by 1 and if the supporting read and the template read pass through the different dominant branches, the score is decreased by 1. A higher score means that the template read and the supporting read are more likely from the same haplotype. If the supporting reads come from different haplotypes, the histogram of scores of all supporting reads should show two peaks (Supplementary Figure. 2

c).
Then, we select the high-scoring supporting reads, which are very likely to be in the same haplotype with the template read, to correct the template read. To further increase the likelihood of selecting supporting reads in the same haplotype, we assign different weights to the reads according to their score. The higher the score of a read, the larger its weight is assigned (Methods). Then, we remove unselected reads in the POA graph by assigning their weights to 0 and find the path with the highest weight in the POA graph using dynamic programming and backtracking.
We evaluate the performance of the selection method on the diploid datasets with reads classified by the trio-binning algorithm 6 (Methods). As shown in and 8.5% inconsistent reads in selected supporting reads, respectively. After further reweighting the scores, the percentages of inconsistent reads are reduced to 3.3%, 6.7%, 4.8% and 6.6%, respectively. For the Nanopore datasets of A. thaliana (COL×C24) and B. taurus (Bison×Simmental), we implement two rounds of error correction and the selection method outputs similar results. After re-weighting the scores, percentages of inconsistent reads in selected reads are 5.4% and 4.3% in the first round of error correction and 3.6% and 1.2% in the second round of error correction.

Supplementary
Since most of the selected supporting reads are from the same haplotype of the template read, most heterozygote alleles in the template read are correctly retained in our error correction method. We compare our method with other methods including  Table 2). The accuracy of SNP alleles in reads corrected by PECAT is 98.62-99.95%, while the accuracies of SNP alleles in reads corrected by other methods are even worse than that of raw reads.

Fast string graph-based assembler
After error correction, PECAT implements two rounds of string graph-based assembly. In each round of assembly, we first construct the overlaps between the corrected reads using the seed-based alignment method (minimap2 20 ), which allows us to build the overlaps quickly. However, seed-based alignment can bring low-quality overlaps with low identity or with long overhangs. Those low-quality overlaps could introduce errors during assembling. Performing local alignment on overlaps to identify low-quality ones becomes the major computational cost. To speed up the assembling, PECAT only performs local alignments when it is necessary during the construction of  Table 3). Then, we further filter out the overlaps whose reads are contained in other reads or with low coverage. Only 0.2%, 0.2%, 0.8%, 0.7%, 0.1% and 0.4% of overlaps are remained on the above six diploid datasets (Supplementary Table 3). We then construct a directed string graph from the remaining overlaps. We find the transitive edges using Mayer's algorithm 18 Table   3).
To remove low-quality edges in the string graph, we calculate the identity of the overlaps (the active edges) in the string graph using local alignments. Since most edges have been marked as inactive, we only need to perform local alignments for a small portion of overlaps. On the above diploid datasets, 1.5%, 15.0%, 13.2%, 2.8%, 17.4%, and 7.4% of active edges have been removed because their identities are less than the threshold (0.95 by default) (Supplementary Table 3). After the above step, some paths in the graphs connected by low-quality edges are broken. And those broken paths need to be connected using the transitive edges that have been labeled inactive edges. We then select transitive edges with the longest alignment and their identity greater than the threshold to connect the broken paths. On the above diploid datasets, about 0.4%, 0.3%, 0.2%, 0.1%, 0.3%, and 0.1% of transitive edges have been reactivated (Supplementary Table 3). In the first round of assembly (Figure 1b), PECAT finds linear paths from this string graph and constructs haplotype-collapsed contigs. In the second round of assembly (Figure 1c), PECAT identifies and removes inconsistent overlaps (next Section) in the string graph, and then generates primary and alternate contigs.

Identification of inconsistent overlaps
The inconsistent overlaps connect reads from different haplotypes or different copies of the segmental duplication in the overlap graph. After identifying and filtering out inconsistent overlaps, only the overlaps between reads from the same haplotype or the same copies of the segmental duplication are left, and then the assembler can naturally generate contigs of two haplotypes. Since our corrected reads contain allele information, we may use the SNP allele information to identify inconsistent overlaps.
If the SNP alleles on a pair of reads are different, these two reads should come from different haplotypes and the overlaps between them should be inconsistent. At the error correction step, we have scored the possibility that two raw reads come from the same haplotype. However, the accuracy of the previous scoring that is based on the alignments between two raw reads cannot meet the requirements for identifying inconsistent overlaps. Therefore, we develop a read-level SNP caller and a read grouping method using our corrected reads for identifying inconsistent overlaps.
First, we map all corrected reads onto the haplotype-collapsed contigs from the first round of assembly using minimap2 20 . We call the heterozygous SNP sites based on the base frequency of the alignments (Methods). Then, we set each read as a template read and collect a set of reads that have the common SNP sites with it. We cluster the set of reads according to the SNP alleles in reads. The reads from the same haplotype are more likely to be clustered into the same subgroup. We then verify and correct the SNP alleles in the template read using other reads in the same subgroups.
After identifying SNP alleles in each read, we remove the inconsistent overlaps using the SNP information from the candidate overlaps found in the first round of assembly.
Then, we construct the string graph again.
Our method can effectively eliminate the effects of sequencing errors and falsepositive SNPs and improve the accuracy of identifying inconsistent overlaps. We evaluate the performance of the inconsistent overlaps identification method using the haplotype reads classified by the trio-binning algorithm 6 . As shown in Supplementary respectively. Furthermore, compared with the first round of assembly, the percentages of inconsistent overlaps in simplified graphs of the second round of assembly decrease sharply. For example, for S. cerevisiae (SK1×Y12), the percentage of inconsistent overlaps has decreased from 36.78% in first round of assembly to 0.3% in second round of assembly; and for B. taurus (Angus×Branman), the percentage of inconsistent overlaps has decreased from 33.92% in first round of assembly to 1.01% in second round of assembly. The high performance of the inconsistent overlap identification method ensures that PECAT can generate the haplotype-specific contigs effectively.
Moreover, our inconsistent overlap identification method can also help solve nearly identical repeats without knowing the number of their copies. The clustering step can automatically separate nearly identical repeats if there are SNPs that can divide the repeats. After filtering out the inconsistent overlaps at the repeats, PECAT can solve the repeat to generate contiguous contigs in the second round of assembly. As a result, PECAT achieves more contiguous assemblies in the second round of assembly on most of our diploid datasets (Supplementary Table 5). As shown in Supplementary   Figures 3,4, PECAT perfectly solves repeats in the NCTC9024 and NCTC9006 datasets and reconstructs two circle contigs, while other assemblers obtain fragment results 22,23 .

Performance of PECAT error correction method
We evaluate the performance of the PECAT error correction method using four  16 . We first assess the quality of corrected reads with respect to the accuracy, maximum length, and N50 of 40X longest corrected reads. Table 1, all methods have similar performance on these four metrics. We then assess the haplotype-specific k-mers completeness and consistency of corrected reads (Methods). These two metrics can evaluate the ability to retain consistent heterozygote alleles. Table 1 shows that reads corrected by PECAT have higher consistency and haplotype-specific k-mers completeness in all datasets. Especially, haplotype-specific k-mers consistencies of reads corrected by PECAT are greater than or equal to 98.4% on all datasets, while haplotype-specific k-mers consistencies of reads corrected by other methods are all less than or equal to 94.6%. This demonstrates the advantages of PECAT for correcting diploid sequences. The reads corrected by PECAT contain more haplotype-specific k-mers than those corrected by other methods (Figure 2 and Supplementary Figures 5-9). Especially, PECAT effectively avoids mixing heterozygous alleles from two haplotypes and the corrected reads tend to contain only one type of haplotype-specific k-mer (Figure 2 and Supplementary   Figures 5-9).

Performance of PECAT assembler
We also assess the performance of the PECAT assembler using four PacBio diploid We first evaluate assembled genomes with respect to the assembly size, number of contigs, max length, and NG50 of primary and alternate contigs. As shown in Table 2 which is assembled using trio-binning methods using paternal reads and Hi-C reads 27 .
Then, we evaluate the quality of assembled genomes using BUSCO 28 and the "Quality", "Haploblock NG50" and "Intra-block switch error" from merqury 29  Furthermore, as shown in Figure 3 and Supplementary Figures 10-14

Discussion
Currently, directly using long noisy reads for diploid genome assembly remains a challenge because it is difficult to distinguish haplotype differences from sequencing errors. In this study, we develop PECAT to overcome sequencing errors in long noisy reads and construct the diploid genome. The PECAT includes a haplotype-aware error correction method, a read-level SNP caller and a read grouping method. The haplotypeaware error correct method has kept most of the heterozygote alleles while correcting sequencing errors. The read-level SNP caller further reduces the SNP errors in corrected reads. And the read grouping method assigns reads to different haplotype groups. Those three methods together allow PECAT to eliminate the impact of sequencing errors and identify inconsistent overlaps, and then generate more contiguous haplotype-specific contigs with fewer errors. Our experimental results show that PECAT is an efficient error correction and assembly pipeline for diploid genomes.
In addition, the read grouping method can also help separate reads from nearly identical repeats with multiple copies. By assign reads into different group without knowing the number of their copies, the read grouping methods can identify inconsistent overlaps between reads from those nearly identical repeats in genome, which then help solve those repeats during assembly. Previous methods, such as FALCON-Unzip, have assumed that there are only two copies of reads in collapsed contigs, and then phases the reads using heterozygous SNPs information in local blocks to separate collapsed contigs into two contigs only, which cannot solve the repeats with more than two copies.

Diploid datasets for benchmarking
We evaluate the performance of PECAT using six diploid datasets (Supplementary

Haplotype-aware error correction
PECAT error correction method is based on the partial-order alignment (POA) graph method 13,30 . Instead of assigning the same weight to each read for error correction, we select reads more likely coming from the same haplotype or the same copy of the nearly identical repeat and assign different weights to prevent heterozygotes in reads from being eliminated as sequencing errors. First, we find candidate overlaps between raw reads using minimap2 with parameters "-x ava-pb" for PacBio CLR reads and parameters "-x ava-ont" for Nanopore reads. For each template read to be corrected, we collect a group of supporting reads that have candidate overlaps with the template read.
Then, we perform pairwise alignment between the template read and each supporting read using diff 21

Read-level SNP caller and read grouping method for identifying inconsistent overlaps Calling heterozygous SNPs in haplotype-collapsed contigs and SNP alleles in reads
We map corrected reads to the first round of assembly using minimap2 with parameters "-x map-pb -c -p 0.5 -r 1000" for PacBio reads and parameters "-x map-ont -c -p 0.5 -r 1000" for Nanopore reads. It performs base-level alignment and generates CIGAR strings. We scan the CIGRA strings and call heterozygous SNPs for each contig.
We call a base site as a heterozygous SNP site if it meets the following two conditions.

Verifying and correcting SNP alleles
To verify and correct SNP alleles in a read, we need to find which reads are from the same haplotype within its local region. For each read labeled as the template read , we assume that it covers a set of SNP sites = { 1 , 2 , … , }. We collect the query reads that cover 3 common SNP sites with the template read . The template read and the query reads are put into a group . We cluster the reads in the group according to the SNP alleles in them. The reads in the same cluster can be considered from the same haplotype. To facilitate the description of the method, here we make some definitions.
We define a centroid = {( , )| ∈ } for the group at the SNP site set . is defined to + − − , where + is the number of the read set { | ( ) = 1, ∈ , ∈ } and − is the number of the read set { | ( ) = 2, ∈ , ∈ }. is defined to + + − . One read can be regarded as a group only including it. We define the following three formulas to get verified SNP sites of the centroid . The parameters 0 and 1 are used to control which sites are valid. Next, we define the operations between two centroids.
( 1 , 2 , 0 , 1 ) and ( 1 , 2 , 0 , 1 ) are used to describe the similarity and the distance of the centroids 1 and 2 . A read can also be used as the parameter, which means the centroid of { }.
For each template read and related group , we use a divide-then-combine strategy to cluster reads (Supplementary Figure 15 b). In the dividing step, we use a modified bisecting k-means algorithm 32  3. We calculate the centroids for the first two subgroups and repeat step 2 until the three subgroups don't change.
After the group is divided into three subgroups, we repeat the above steps to continue dividing the subgroups. If the subgroup and its centroid meet the After combining the subgroups, we think the reads in the same subgroup in the list are from the same haplotype or the same copy of the repeat. The SNP alleles in the read can be verified by the centroid of its subgroup, which helps to find inconsistent overlaps more accurately. In addition, since the group 1 contains the template read , we use the centroid of the group 1 to correct SNP alleles in the template read . Here, the read is corrected when it is treated as a template read. After all reads are corrected, we run the verifying and correcting SNP alleles step again to obtain more robust results.
We run the step three times by default.
The parameters 6 , 7 , 8 and 9 are mentioned above. The overlaps between the inconsistent read pairs are identified as inconsistent overlaps. We record the inconsistent overlap information and SNP alleles in reads for subsequent steps.

Fast string-graph-based assembler
According to the characteristics of k-mer-based alignment and string graphs, we propose a fast string-graph-based assembler to balance the quality and speed of assembling. First, we use minimap2 with parameters "-X -g3000 -w30 -k19 -m100 -r500" to find candidate overlaps between corrected reads. Minimap2 20 with those parameters invokes the k-mer-based alignment. To reduce overhangs of overlaps, we extend the alignment to the ends of the reads with the diff 21 algorithm and filter out overlaps still with long overhangs. Next, we remove the overlaps whose reads are contained in other reads or with low coverage. A directed string graph is constructed using the remaining overlaps. Mayer's algorithm 18 is used to mark transitive edges as inactive ones. We implement local alignment using the edlib 31 algorithm to calculate the identity of each active edge, which is defined as the identity of its related overlap.
Only a few edges need to calculate identities for most of the edges are marked as inactive ones. The edges whose identities are less than the threshold (default value is 0.95) are marked as low quality and removed. This step also breaks some paths connected by low-quality edges in the graphs. To repair those paths, we check deadend nodes whose outdegree or indegree are equal to 0. We calculate the identities of their transitive edges and reactivate the edge with the longest alignment and the identity greater than the threshold. In this way, an appropriate string graph is constructed. After performing other simplifying processes, such as removing the ambiguous edges (tips, bubbles, and spurious links) in the graph, we identify linear paths from the graph and generate contigs.

Improvement of best overlap graph algorithm
Best overlap graph algorithm 33  Then, if the node has no best in-edge or best out-edge, its first in-edge or first out-edge is selected as the best one. Finally, the edges selected as the best edges are retained and the other edges are removed from the string graph.

Generating primary and alternate contigs
After simplifying the string graph, PECAT identifies three structures from the string graph, as shown in Supplementary Figure 16.

Polishing primary and alternate contigs
After generating primary and alternate contigs, we use corrected reads or raw reads to polish them. We map the reads to the primary and alternate contigs using minimap2.
In previous steps, we record which read pairs are inconsistent and which reads construct the contigs. If the read is inconsistent with the reads used for constructing the contig, the related alignments are removed. This trick helps to reduce haplotype switch errors.
After removing other low-quality alignments, we run racon 34 with default parameters to polish the contigs.

Evaluation
To evaluate the effectiveness of correction methods fairly, we extract the 40X longest reads from corrected reads and then evaluate them. We map the reads to the reference genome using minimap2 with parameters "-c --eqx". It generates CIGAR strings. We scan the CIGAR strings to calculate the accuracy of each corrected data. To evaluate the haplotype-specific k-mers completeness, we calculate the percentage of parentspecific k-mers in 40X longest reads. Considering that there are 40X datasets, we only count k-mers with equal to or more than 4 occurrences. To evaluate the haplotypespecific k-mers consistency, we calculate the metric as ∑max( , )/∑( + ), in which and are the number of paternal and maternal haplotype-specific k-mers in each read.
To evaluate the performance of selecting supporting reads and the performance of finding inconsistent overlaps, we use Illumina reads of parents to classify long reads with a trio-binning algorithm 6  . Otherwise, it is classified to the untagged reads. When one read of the pair of reads is a maternal read and the other read is a paternal read or vice versa, we think the pair of reads is inconsistent.
We use k-mer based assembly evaluation tool merqury 29

Accession codes
All source codes for PECAT are available from https://github.com/lemene/PECAT.  The first round of assembly. PECAT finds overlaps between reads. The dashed lines indicate there are overlaps between reads. Our string-graph-based assembler is performed to construct haplotype-collapsed contigs. (c) The second round of assembly. PECAT identifies inconsistent overlaps by calling and analyzing SNP alleles in reads. After removing inconsistent overlaps from the overlaps found in the first round of assembly, PECAT performs our string-graphbased assembler again to construct primary/alternate contigs. Figure 2. Haplotype-specific k-mers consistency and completeness of D. melanogaster (ISO1 × A4) raw reads and corrected reads by the different methods. (a) Each point corresponds to a read. Its coordinate gives proportion of the parental specific k-mers in the read. There are 1000 randomly selected reads from 40X longest reads for each sub figure. (b, c) The percentages of the parental specific k-mers whose occurrences are greater or equal to 1-10 in 40X longest reads.